LLM Benchmark Improvements + More Evals by bradleyshep · Pull Request #4740 · clockworklabs/SpacetimeDB

bradleyshep · 2026-04-01T15:13:12Z

Description of Changes

LLM benchmark infrastructure improvements and new benchmark tasks.

Runner & scoring:

Add retry logic with backoff for LLM API calls (rate limits, 502/503/504, timeouts)
Fix generation_duration_ms to only time the successful attempt, not retries+sleep delays
Add --dry-run flag to run benchmarks without saving results
Add OpenRouter client as unified fallback when direct vendor keys aren't set
Add web search mode via OpenRouter :online suffix
Extract shared OpenAI-compatible response types into oa_compat.rs
Add ReducerCallBothScorer for calling reducers on both golden and LLM databases
Set max_tokens on OpenRouter and Meta clients to prevent silent truncation

Model routing:

Add ModelRoute with display name, vendor, API model, and OpenRouter model ID
Support ad-hoc model IDs via --models vendor:model without static registration
Add model name normalization (OpenRouter IDs, case variants → canonical display names)

Context modes:

Add guidelines, cursor_rules, search, no_context modes with is_empty_context_mode() helper
Add mode-specific prompt preambles
Consolidate mode alias normalization (none/no_guidelines → no_context)

CI workflows:

Add llm-benchmark-periodic.yml for scheduled nightly runs with per-language failure tracking
- Note: The periodic workflow requires OPENROUTER_API_KEY, LLM_BENCHMARK_UPLOAD_URL, and LLM_BENCHMARK_API_KEY as GitHub secrets.
Add llm-benchmark-validate-goldens.yml for validating golden answers still compile

Results & summary:

Add cmd_status to show incomplete benchmark combinations with rerun commands
Add cmd_analyze for LLM-powered failure analysis
Split normalize_details_file from write_summary_from_details_file
Derive task categories from filesystem for summary generation
Add timestamp tracking (started_at/finished_at) and token usage

New benchmark tasks:

30 new tasks across auth, data_modeling, queries, basics, and schema categories
Updated/fixed existing task prompts and golden answers

API and ABI breaking changes

None. Internal tooling only.

Expected complexity level and risk

2 — Changes are scoped to the LLM benchmark CLI tool (xtask-llm-benchmark) and CI workflows. No impact on SpacetimeDB core.

Testing

cargo check -p xtask-llm-benchmark — zero errors, zero warnings
Dry run: llm_benchmark run --lang typescript --modes no_context --tasks t_001 --models openai:gpt-5-mini --dry-run — ran end-to-end, confirmed no results saved to disk
Verify periodic workflow runs successfully on next scheduled trigger

…com/clockworklabs/SpacetimeDB into bradley/llm-benchmarks-improvements

This reverts commit 40ef9e8.

…com/clockworklabs/SpacetimeDB into bradley/llm-benchmarks-improvements

bradleyshep · 2026-04-15T20:16:38Z

@cloutiertyler

UPDATES:

JSON elimination:

Deleted all JSON result/summary files from docs/llms/ (8 files, ~168K lines)
Removed all JSON file I/O code: merge_task_runs, load_results, save_atomic, normalize_details_file, write_summary_from_details_file, update_golden_answers_on_disk, load_summary
Removed Summary, LangSummary, ModeSummary, ModelSummary, CategorySummary, Totals, GoldenAnswer types
Removed fs2 and tempfile dependencies
Deleted entire results/ module

New API client (src/api/):

upload_batch() — POSTs results + per-model AI analysis to POST /api/llm-benchmark-upload
upload_task_catalog() — POSTs task catalog with per-language prompts and golden answers to POST /api/llm-benchmark-tasks
fetch_run_dates() / fetch_failures() / upload_analysis() — for the analyze command

AI analysis (src/bench/analysis.rs):

Builds analysis prompts from failures + golden answers read from disk
Mode-aware: adjusts context description, includes fix section only for modes with documentation context
Consistent structured template across all analysis paths

CLI simplification:

Reduced from 9 commands to 2: run and analyze
run — runs benchmarks, uploads results + task catalog + analysis to DB, auto-retries provider failures
analyze — fetches failures from DB for a specific run date, generates analysis per (lang, mode, model), uploads or saves locally with --dry-run

Local output (--dry-run):

--dry-run saves results to runs/run-{id}.json
--dry-run --local-analysis additionally generates per-lang/mode/model markdown analysis files under runs/, sharing the same run id
Non-dry runs with no upload client skip analysis entirely instead of spending tokens and discarding the result

Provider failure handling:

Auto-retries tasks where LLM never responded (timeouts, 429s, 502/503/504) up to 3 rounds with 30s delay
Stops early if no tasks recover in a round (provider likely down)
Unrecovered tasks are retained as recorded RunOutcome failures instead of being dropped — runs keep one result per selected task
Retry bookkeeping uses PendingRetry struct to track last error per task

CI workflows:

Cleaned up llm-benchmark-periodic.yml — removed --upload-url flag, summary generation, JSON commit/push steps, added permissions: contents: read
Deleted llm-benchmark-update.yml entirely
Removed llm_ci_check job from ci.yml

File structure cleanup:

Renamed bench/results_merge.rs → bench/normalize.rs
Moved Results/LangEntry/ModeEntry/ModelEntry from results/schema.rs into bench/types.rs
Deleted results/ module entirely
Fixed UTF-8 truncation panic in analysis prompt builder (byte-index → char-index)

Context mode changes:

guidelines mode now reads from skills/ directory (concepts/SKILL.md + {lang}-server/SKILL.md) instead of old docs/static/ai-guidelines/
Deleted docs/static/ai-guidelines/ directory (replaced by skills files)
Removed all cursor_rules mode references (directory was deleted)
Added skills_dir() helper in constants

Prompt normalization:

Stripped UTF-8 BOM from 9 benchmark prompt files
Normalized TypeScript prompt first lines to match Rust/C# wording across all 19 divergent tasks

Other:

Merged bradley/llm-single-source-of-truth branch
Added tools/xtask-llm-benchmark/.gitignore for runs/ directory

# Description of Changes AI app generation benchmark comparing SpacetimeDB vs PostgreSQL (Express + Socket.io + Drizzle ORM). Same AI model (Claude Sonnet 4.6), same prompts, same chat app, two backends. Upgraded through 12 feature levels, manually graded at each level, bugs fixed, all costs measured via OpenTelemetry. Results viewable at: https://spacetimedb.com/llms-benchmark-sequential-upgrade ## Benchmark harness (`tools/llm-sequential-upgrade/`) - `run.sh`: orchestrates headless Claude Code sessions for code generation, sequential upgrades, and bug fixes. Tracks all API costs via OTel. Supports `--upgrade`, `--fix`, `--composed-prompt`, `--resume-session` modes. - `grade.sh` / `grade-agents.sh`: grading harnesses for manual testing of generated apps. - `docker-compose.otel.yaml`: OTel collector + PostgreSQL services. - `generate-report.mjs` / `parse-telemetry.mjs`: aggregate per-session telemetry into cost reports. - Backend guidelines in `backends/`: SpacetimeDB SDK reference, config templates, server setup docs, PostgreSQL setup with Drizzle/Socket.io guidance. **After clockworklabs#4740 merges, we will likely want to update this so that it reads backend and SDK guidance from SKILLS** ## Two complete benchmark runs **Run 1 (20260403):** Original methodology. **Run 2 (20260406):** Refined methodology with domain bias removed from SpacetimeDB SDK docs and PostgreSQL instructions made feature-spec-neutral. **Note: no meaningful changes in results were observed with these changes. Domain familiarity biases were very small and almost certainly not the cause of STDB's major gains over PG stack.** Each run contains full L1-L12 app source for both backends, level snapshots preserving state before each upgrade, and per-session OTel cost summaries. ## 12 feature levels | Level | Feature | |---|---| | L1 | Basic Chat + Typing + Read Receipts + Unread Counts | | L2 | Scheduled Messages | | L3 | Ephemeral Messages | | L4 | Message Reactions | | L5 | Message Editing with History | | L6 | Real-Time Permissions (kick, ban, promote) | | L7 | Rich User Presence | | L8 | Message Threading | | L9 | Private Rooms + Direct Messages | | L10 | Room Activity Indicators | | L11 | Draft Sync | | L12 | Anonymous to Registered Migration | ## Results | | Run 1 (20260403) | Run 2 (20260406) | |---|---|---| | **SpacetimeDB total cost** | $13.33 | $12.62 | | **PostgreSQL total cost** | $17.80 | $19.68 | | **SpacetimeDB bugs** | 5 | 2 | | **PostgreSQL bugs** | 19 | 8 | | **SpacetimeDB fix sessions** | 4 | 1 | | **PostgreSQL fix sessions** | 17 | 10 | Both runs agree: SpacetimeDB apps are cheaper to build, have fewer bugs, and require fewer fix iterations. The refined methodology (Run 2) widened the cost gap and **confirmed the advantage is structural, not an artifact of domain-biased SDK docs.** ## Performance benchmark (`perf-benchmark/`) Stress throughput tool that fires concurrent writers at peak saturation against the AI-generated `send_message` handlers. | Tier | SpacetimeDB (avg) | PostgreSQL (avg) | Ratio | |---|---|---|---| | AI-generated (as-shipped) | 5,267 msgs/sec | 694 msgs/sec | 7.6x | | PG rate limit removed | 5,267 msgs/sec | 1,070 msgs/sec | 4.9x | | Optimized (same features kept) | 25,278 msgs/sec | 1,139 msgs/sec | 22x | The gap widens with optimization because SpacetimeDB's bottleneck is fixable code patterns in the reducer while PostgreSQL's bottleneck is architectural (sequential network round-trips to an external database). Optimized reference code with all features preserved is in `perf-benchmark/results/optimized-reference/`. ## Data handling Per-session cost summaries (`cost-summary.json`, `COST_REPORT.md`, `metadata.json`) are committed. Raw OTel telemetry (`raw-telemetry.jsonl`) containing PII is excluded via `.gitignore` and stored privately. # API and ABI breaking changes None. All changes are in `tools/llm-sequential-upgrade/`. No production code, library, or SDK changes. # Expected complexity level and risk **1 - Trivial.** Self-contained benchmarking tooling and data. No interaction with production code. # Testing - [x] L1-L12 upgrades completed on all 4 apps (2 backends x 2 runs) with OTel cost capture - [x] All levels manually graded after each upgrade; bugs filed and fixed via the harness - [x] Methodology refinement between runs validated (domain bias removal, feature-neutral instructions) - [x] Stress benchmarks run across both runs x 3 tiers (as-shipped, rate-limit-removed, optimized) - [x] Optimized benchmarks verified to preserve all original features - [x] Sensitive data (PII in raw telemetry) removed from repo and gitignored - [ ] Reviewer: spot-check that METRICS_DATA.json / METRICS_REPORT.json numbers match the telemetry cost-summary.json files --------- Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com> Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>

bradleyshep added 30 commits March 23, 2026 12:08

open router

36a875a

no guidelines variant, new workflows, results save updates

3b88747

new evals batch one

76016e7

query evals

3bdecca

more evals + categories

b3ce8f7

fixes

52e28b9

fixes

617e052

fmt

6eb1168

llm benchmark site

b9a545f

Create ModelDetail.tsx

afee2e0

site + details

61d815e

benchmark site + run

e132ed8

more evals + fixes

56e693f

fixes

1216af6

refinements

ec966f9

updates

00d6598

updates; guidelines mode

850254e

Create README.md

4abe096

fixes

b432278

updates

bed39d0

remove tools/site

9ffba0b

normalize model names

bb26681

scoring fixes

139408e

fixes

6f740ed

results

dd35c66

rust concurrency and details updates

db0e185

Update spacetimedb-typescript.mdc

25a246e

update actions

741fcf4

Merge branch 'master' into bradley/llm-benchmarks-improvements

603b5ee

Update llm-benchmark-periodic.yml

5fd1a0e

bradleyshep added 5 commits April 15, 2026 09:44

remove guidelines -> use skills

7f954b1

cleanup

6fa6418

prompt normalization

b0436d0

Merge branch 'master' into bradley/llm-benchmarks-improvements

7a77779

dry run + local analysis

61a8d86

github-advanced-security AI found potential problems Apr 15, 2026

View reviewed changes

bradleyshep added 6 commits April 15, 2026 14:47

Update runner.rs

4434307

results

40ef9e8

Merge branch 'bradley/llm-benchmarks-improvements' of https://github.…

8a2b485

…com/clockworklabs/SpacetimeDB into bradley/llm-benchmarks-improvements

Revert "results"

b9e12f3

This reverts commit 40ef9e8.

delete

f794ffe

Merge branch 'bradley/llm-benchmarks-improvements' of https://github.…

58317f5

…com/clockworklabs/SpacetimeDB into bradley/llm-benchmarks-improvements

bradleyshep added 2 commits April 15, 2026 16:33

;omts

595aa7a

Analyze run of given date support

2a156d3

bradleyshep mentioned this pull request Apr 16, 2026

LLM Benchmark: Sequential Upgrades Test #4817

Merged

7 tasks

clockwork-labs-bot enabled auto-merge April 18, 2026 21:03

bradleyshep added 2 commits April 20, 2026 09:20

SKILLS: randomness + some cleanup

474c448

Update SKILL.md

d74b2a9

clockwork-labs-bot disabled auto-merge April 24, 2026 14:53

cloutiertyler approved these changes Apr 28, 2026

View reviewed changes

Merge branch 'master' into bradley/llm-benchmarks-improvements

e1b2224

cloutiertyler enabled auto-merge April 28, 2026 16:37

Merge branch 'master' into bradley/llm-benchmarks-improvements

908f439

cloutiertyler added this pull request to the merge queue May 11, 2026

github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks May 11, 2026

bfops added this pull request to the merge queue May 11, 2026

Merged via the queue into master with commit be86a51 May 11, 2026
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Benchmark Improvements + More Evals#4740

LLM Benchmark Improvements + More Evals#4740
bfops merged 86 commits into
masterfrom
bradley/llm-benchmarks-improvements

bradleyshep commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bradleyshep commented Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bradleyshep commented Apr 1, 2026

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bradleyshep commented Apr 15, 2026

UPDATES:

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants